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Abstract 

We  propose  a  novel  method  of  dimensionality  reduction  for  supervised  learning  problems. 
Given  a  regression  or  classification  problem  in  which  we  wish  to  predict  a  response  variable 
Y  from  an  explanatory  variable  X,  we  treat  the  problem  of  dimensionality  reduction  as 
that  of  finding  a  low-dimensional  “effective  subspace”  of  X  which  retains  the  statistical 
relationship  between  X  and  Y .  We  show  that  this  problem  can  be  formulated  in  terms 
of  conditional  independence.  To  turn  this  formulation  into  an  optimization  problem  we 
establish  a  general  nonparametric  characterization  of  conditional  independence  using  co- 
variance  operators  on  a  reproducing  kernel  Hilbert  space.  This  characterization  allows  us 
to  derive  a  contrast  function  for  estimation  of  the  effective  subspace.  Unlike  many  conven¬ 
tional  methods  for  dimensionality  reduction  in  supervised  learning,  the  proposed  method 
requires  neither  assumptions  on  the  marginal  distribution  of  X ,  nor  a  parametric  model  of 
the  conditional  distribution  of  Y.  We  present  experiments  that  compare  the  performance 
of  the  method  with  conventional  methods. 

1.  Introduction 

Many  statistical  learning  problems  involve  some  form  of  dimensionality  reduction,  either 
explicitly  or  implicitly.  The  goal  may  be  one  of  feature  selection ,  in  which  we  aim  to  find 
linear  or  nonlinear  combinations  of  the  original  set  of  variables,  or  one  of  variable  selection , 
in  which  we  wish  to  select  a  subset  of  variables  from  the  original  set.  The  setting  may  be 
unsupervised  learning,  in  which  a  set  of  observations  of  a  random  vector  X  are  available,  or 
supervised  learning,  in  which  desired  responses  or  labels  Y  are  also  available.  Developing 
methods  for  dimensionality  reduction  requires  being  clear  on  the  goal  and  the  setting,  as 
methods  developed  for  one  combination  of  goal  and  setting  are  not  generally  appropriate  for 
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another.  There  are  additional  motivations  for  dimensionality  reduction  that  it  is  also  helpful 
to  specify,  including:  providing  a  simplified  explanation  of  a  phenomenon  for  a  human 
(possibly  as  part  of  a  visualization  algorithm),  suppressing  noise  so  as  to  make  a  better 
prediction  or  decision,  or  reducing  the  computational  burden.  These  various  motivations 
are  often  complementary. 

In  this  paper  we  study  dimensionality  reduction  in  the  setting  of  supervised  learning. 
Thus,  we  consider  problems  in  which  our  data  consist  of  observations  of  ( X ,  Y)  pairs,  where 
X  is  an  m-dimensional  explanatory  variable  and  where  Y  is  an  ^-dimensional  response.  The 
variable  Y  may  be  either  continuous  or  discrete.  We  refer  to  these  problems  generically  as 
“regression”  problems,  which  indicates  our  focus  on  the  conditional  probability  density 
function  py \x{y  I  x)-  In  particular,  our  framework  includes  discriminative  approaches  to 
classification  problems,  where  Y  is  a  discrete  label. 

We  wish  to  solve  a  problem  of  feature  selection  in  which  the  features  are  linear  combi¬ 
nations  of  the  components  of  X.  In  particular,  we  assume  that  there  is  an  r-dinrensional 
subspace  S  C  Mm  such  that 


py\x(v  I  x)  =  py\nsx(y  I  n5x),  (1) 

for  all  x  and  y.  where  II5  is  the  orthogonal  projection  of  Km  onto  S.  The  subspace  S  is 
called  the  effective  subspace  for  regression.  Based  on  a  set  of  observations  of  ( X ,  Y)  pairs, 
we  wish  to  recover  a  matrix  whose  columns  span  the  effective  subspace. 

We  approach  the  problem  as  a  semiparametric  statistical  problem;  in  particular,  we  make 
no  assumptions  regarding  the  conditional  distribution  py\usx(y  I  II^x),  nor  do  we  make 
any  assumptions  regarding  the  marginal  distribution  px(x).  That  is,  we  wish  to  estimate  a 
finite-dimensional  parameter  (a  matrix  whose  columns  span  the  effective  subspace),  while 
treating  the  distributions  Py\nsx{y  I  n^x)  and  px{x)  nonparametrically. 

Having  found  an  effective  subspace,  we  may  then  proceed  to  build  a  parametric  or 
nonparametric  regression  model  on  that  subspace.  Thus  our  approach  is  an  explicit  dimen¬ 
sionality  reduction  method  for  supervised  learning  that  does  not  require  any  particular  form 
of  regression  model,  and  can  be  used  as  a  preprocessor  for  any  supervised  learner.  This 
can  be  compared  to  the  use  of  methods  such  as  principal  components  analysis  (PCA)  in 
regression,  which  also  make  no  assumption  regarding  the  subsequent  regression  model,  but 
fail  to  make  use  of  the  response  variable  Y . 

There  are  a  variety  of  related  approaches  in  the  literature,  but  most  of  them  involve 
making  specific  assumptions  regarding  the  conditional  distribution  Py\nsx{y  I  n^x),  the 
marginal  distribution  px(x),  or  both.  For  example,  classical  two-layer  neural  networks 
involve  a  linear  transformation  in  the  first  “layer,”  followed  by  a  specific  nonlinear  function 
and  a  second  layer  (Bishop,  1995).  Thus,  neural  networks  can  be  seen  as  attempting  to 
estimate  an  effective  subspace  based  on  specific  assumptions  about  the  regressor  py\nsx{y  I 
nsx).  Similar  comments  apply  to  projection  pursuit  regression  (Friedman  and  Stuetzle, 
1981),  ACE  (Breiman  and  Friedman,  1985)  and  additive  models  (Hastie  and  Tibshirani, 
1986),  all  of  which  provide  a  methodology  for  dimensionality  reduction  in  which  an  additive 
model  E[Y  \  X]  =  gi(0[ X)  +  •  •  •  +  gxiP^X)  is  assumed  for  the  regressor. 

Canonical  correlation  analysis  (CCA)  and  partial  least  squares  (PLS,  Hoskuldsson,  1988, 
Helland,  1988)  are  classical  multivariate  statistical  methods  that  can  be  used  for  dimension¬ 
ality  reduction  in  regression  (Fung  et  al.,  2002,  Nguyen  and  Rocke,  2002).  These  methods 
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are  based  on  a  linearity  assumption  for  the  regressor,  however,  and  thus  are  quite  strongly 
parametric. 

The  line  of  research  that  is  closest  to  our  work  has  its  origin  in  a  technique  known 
as  sliced  inverse  regression  (SIR,  Li,  1991).  SIR  is  a  semiparametric  method  for  finding 
effective  subspaces  in  regression.  The  basic  idea  is  that  the  range  of  the  response  variable 
Y  is  partitioned  into  a  set  of  “slices,”  and  the  sample  means  of  the  observations  X  are 
computed  within  each  slice.  This  can  be  viewed  as  a  rough  approximation  to  the  inverse 
regression  of  X  on  Y.  For  univariate  Y  the  method  is  particularly  easy  to  implement.  Noting 
that  the  inverse  regression  must  lie  in  the  effective  subspace  if  the  forward  regression  lies 
in  such  a  subspace,  principal  component  analysis  is  then  used  on  the  sample  means  to  find 
the  effective  subspace.  Li  (1991)  has  shown  that  this  approach  can  find  effective  subspaces, 
but  only  under  strong  assumptions  on  the  marginal  distribution  px{x) — in  particular,  the 
marginal  distribution  must  be  elliptically  symmetric. 

Further  developments  in  the  wake  of  SIR  include  principal  Hessian  directions  (pHd,  Li, 
1992),  and  sliced  average  variance  estimation  (SAVE,  Cook  and  Weisberg,  1991,  Cook  and 
Yin,  2001).  These  are  all  semiparametric  methods  in  that  they  make  no  assumptions  about 
the  regressor  (see  also  Cook,  1998).  However,  they  again  place  strong  restrictions  on  the 
probability  distribution  of  the  explanatory  variables.  If  these  assumptions  do  not  hold, 
there  is  no  guarantee  of  finding  the  effective  subspace. 

There  are  also  related  nonparametric  approaches  that  estimate  the  derivative  of  the 
regressor  to  achieve  dimensionality  reduction,  based  on  the  fact  that  the  derivative  of 
the  conditional  expectation  E[y  \  BTx\  with  respect  to  x  belongs  to  the  effective  sub¬ 
space  (Samarov,  1993,  Hristache  et  al.,  2001).  However,  nonparametric  estimation  of  deriva¬ 
tives  is  quite  challenging  in  high-dinrensional  spaces. 

There  are  also  dimensionality  reduction  methods  with  a  semiparametric  flavor  in  the  area 
of  classification,  notably  the  work  of  Torkkola  (2003),  who  has  proposed  using  nonparametric 
estimation  of  the  mutual  information  between  X  and  Y ,  and  subsequent  maximization  of 
this  estimate  of  mutual  information  with  respect  to  a  matrix  representing  the  effective 
subspace. 

In  this  paper  we  present  a  novel  semiparametric  method  for  dimensionality  reduction 
that  we  refer  to  as  Kernel  Dimensionality  Reduction  (KDR).  KDR  is  based  on  the  esti¬ 
mation  and  optimization  of  a  particular  class  of  operators  on  reproducing  kernel  Hilbert 
spaces  (Aronszajn,  1950).  Although  our  use  of  reproducing  kernel  Hilbert  spaces  is  related 
to  their  role  in  algorithms  such  as  the  support  vector  machine  and  kernel  PCA  (Boser  et  al., 
1992,  Vapnik  et  al.,  1997,  Scholkopf  et  al.,  1998),  where  the  kernel  function  allows  linear  op¬ 
erations  in  function  spaces  to  be  performed  in  a  computationally-efficient  manner,  our  work 
differs  in  that  it  cannot  be  viewed  as  a  “kernelization”  of  an  underlying  linear  algorithm. 
Rather,  we  use  reproducing  kernel  Hilbert  spaces  to  provide  characterizations  of  general 
notions  of  independence,  and  we  use  these  characterizations  to  design  objective  functions 
to  be  optimized.  We  build  on  earlier  work  by  Bach  and  Jordan  (2002a),  who  showed  how 
to  use  reproducing  kernel  Hilbert  spaces  to  characterize  marginal  independence  between 
pairs  of  variables,  and  thereby  design  an  objective  function  for  independent  component 
analysis.  In  the  current  paper,  we  extend  this  line  of  work,  showing  how  to  characterize 
conditional  independence  using  reproducing  kernel  Hilbert  spaces.  We  achieve  this  by  ex- 
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pressing  conditional  independence  in  terms  of  covariance  operators  on  reproducing  kernel 
Hilbert  spaces. 

How  does  conditional  independence  relate  to  our  dimensionality  reduction  problem? 
Recall  that  our  problem  is  to  find  a  projection  of  X  onto  a  subspace  S  such  that  the 
conditional  probability  of  Y  given  X  is  equal  to  the  conditional  probability  of  Y  given  n,gX. 
This  is  equivalent  to  finding  a  projection  which  makes  Y  and  ( I  —  n^) X  conditionally 
independent  given  n^X.  Thus  we  can  turn  the  dimensionality  reduction  problem  into  an 
optimization  problem  by  expressing  it  in  terms  of  covariance  operators. 

In  a  presence  of  a  finite  sample,  we  need  to  estimate  the  covariance  operator  so  as  to 
obtain  a  sampled-based  objective  function  that  we  can  optimize.  We  derive  a  natural  plug¬ 
in  estimate  of  the  covariance  operator,  and  find  that  the  resulting  estimate  is  identical  to 
the  kernel  generalized  variance  that  has  been  described  earlier  by  Bach  and  Jordan  (2002a) 
in  the  setting  of  independent  component  analysis.  In  that  setting,  the  goal  is  to  measure 
departures  from  independence,  and  the  minimization  of  the  kernel  generalized  variance  can 
be  viewed  as  a  surrogate  for  minimizing  a  certain  mutual  information.  In  the  dimensionality 
reduction  setting,  on  the  other  hand,  the  goal  is  to  measure  conditional  independence,  and 
minimizing  the  kernel  generalized  variance  can  be  viewed  as  a  surrogate  for  maximizing 
a  certain  mutual  information.  Not  surprisingly,  the  derivation  that  leads  to  the  kernel 
generalized  variance  that  we  present  here  is  quite  different  from  the  one  presented  in  the 
earlier  work  on  kernel  ICA.  Moreover,  the  argument  that  we  present  here  can  be  viewed 
as  providing  a  rigorous  foundation  for  other,  more  heuristic,  ways  in  which  the  kernel 
generalized  variance  has  been  used,  including  the  model  selection  algorithms  for  graphical 
models  presented  by  Bach  and  Jordan  (2003). 

The  paper  is  organized  as  follows.  In  Section  2,  we  introduce  the  problem  of  dimen¬ 
sionality  reduction  for  supervised  learning,  and  describe  its  relation  with  conditional  inde¬ 
pendence  and  mutual  information.  Section  3  derives  the  objective  function  for  estimation 
of  the  effective  subspace  for  regression,  and  describes  the  KDR  method.  All  of  the  math¬ 
ematical  details  needed  for  the  results  in  Section  3  are  presented  in  the  Appendix,  which 
also  provides  a  general  introduction  to  covariance  operators  in  reproducing  kernel  Hilbert 
spaces.  In  Section  4,  we  present  a  series  of  experiments  that  test  the  effectiveness  of  our 
method,  comparing  it  with  several  conventional  methods.  Section  5  describes  an  extension 
of  KDR  to  the  problem  of  variable  selection.  Section  6  presents  our  conclusions. 

2.  Dimensionality  reduction  for  regression 

We  consider  a  regression  problem,  in  which  Y  is  an  ^-dimensional  random  vector,  and  X  is 
an  m-dimensional  explanatory  variable.  (Note  again  that  we  use  “regression”  in  a  generic 
sense  that  includes  both  continuous  and  discrete  Y).  The  probability  density  function  of  Y 
given  X  is  denoted  by  py\x{y  I  x )•  Assume  that  there  is  an  r-dimensional  subspace  S  C 
such  that 

pY\x(y  I  x)  =  pY\nsx(y  I  nsx),  (2) 

for  all  x  and  y.  where  is  the  orthogonal  projection  of  Mm  onto  S.  The  subspace  S  is 
called  the  effective  subspace  for  regression. 

The  problem  that  we  treat  here  is  that  of  finding  the  subspace  S  given  an  i.i.d.  sample 
{(X\ .  Yj), . . . ,  (Xn,  Yn)}  from  px  and  py\x ■  The  crux  of  the  problem  is  that  we  assume  no  a 
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priori  knowledge  of  the  regressor,  and  place  no  assumptions  on  the  conditional  probability 
Py\x- 

As  in  the  simpler  setting  of  principal  component  analysis,  we  make  the  (generally  un¬ 
realistic)  assumption  that  the  dimensionality  r  is  known  and  fixed.  We  discuss  various 
approaches  to  the  estimation  of  the  dimensionality  in  Section  6. 

The  notion  of  effective  subspace  can  be  formulated  in  terms  of  conditional  independence. 
Let  ( B ,  C)  be  the  m-dimensional  orthogonal  matrix  such  that  the  column  vectors  of  B  span 
the  subspace  S,  and  define  U  =  BT X  and  V  =  CT X.  Because  ( B ,  C )  is  an  orthogonal 
matrix,  we  have 

Px(x)  =  puy(u,v),  Px,r(x,y)  =  Puyy(u,v,y),  (3) 

for  the  probability  density  functions.  From  Eq.  (3),  Eq.  (2)  is  equivalent  to 

PY\u,v(y  I  U,v)  =  pY\u{y  I  u).  (4) 

This  shows  that  the  effective  subspace  S  is  the  one  which  makes  Y  and  V  conditionally 
independent  given  U  (see  Figure  1). 

Mutual  information  provides  another  point  of  view  on  the  equivalence  between  con¬ 
ditional  independence  and  the  existence  of  the  effective  subspace.  From  Eq.  (3),  it  is 
straightforward  to  see  that 

I(Y,X)=I(Y,U)  +  Eu[l(Y\U,V\U)],  (5) 

where  I(Z,W)  denotes  the  mutual  information  defined  by 

I(Z,W)  :=  f  [ pz,w(z,w)log  Pz:w^'™\dzdw.  (6) 

J  J  Pz{z)pw(w) 

Because  Eq.  (2)  means  I(Y,X)  =  I(Y,U),  the  effective  subspace  S  is  characterized  as  the 
subspace  which  retains  the  mutual  information  of  X  and  Y  by  the  projection  onto  that 
subspace,  or  equivalently,  which  gives  I(Y\U,V\U)  =  0.  This  is  again  the  conditional 
independence  of  Y  and  V  given  U. 

The  expression  in  Eq.  (5)  can  be  understood  in  terms  of  the  decomposition  of  the 
mutual  information  according  to  a  tree-structured  graphical  model — a  quantity  that  has 
been  termed  the  T-mutual  information  by  Bach  and  Jordan  (2002b).  Considering  the  tree 
Y  —  U  —  V  in  Figure  1(b),  we  have  that  the  T-mutual  information  IT  is  given  by 

IT  =  I(Y,U,V)-I(YU)~I(U,V).  (7) 

This  is  equal  to  the  KL-divergence  between  a  probability  distribution  on  (Y,  U,  V )  and  its 
projection  onto  the  family  of  distributions  that  factor  according  to  the  tree;  that  is,  the  set 
of  distributions  that  verify  Y ALV  \  U.  Using  Eq.  (3),  we  can  easily  see  that  I(Y,U,V )  = 
/(Y,  X)  +  I(U,  V),  and  thus  we  obtain 

IT  =  I(Y,  X)  -  I(Y,  U)  =  Ev [I(Y\U,  V\U)}.  (8) 

Then,  dimensionality  reduction  for  regression  can  be  viewed  as  the  problem  of  minimizing 
the  T-mutual  information  for  the  fixed  tree  structure  in  Figure  1(b). 
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X  V  U 

(a)  (b) 

Figure  1:  Graphical  representation  of  dimensionality  reduction  for  regression.  The  variables 
Y  and  V  are  conditionally  independent  given  U,  where  X  =  (17,  V). 


3.  Kernel  method  for  dimensionality  reduction  in  regression 

In  this  section  we  present  our  kernel-based  method  for  dimensionality  reduction.  We  dis¬ 
cuss  the  basic  definition  and  properties  of  cross-covariance  operators  on  reproducing  kernel 
Hilbert  spaces,  derive  an  objective  function  for  characterizing  conditional  independence  us¬ 
ing  cross-covariance  operators,  and  finally  present  a  sampled-based  objective  function  based 
on  this  characterization. 

3.1  Cross-covariance  operators  on  reproducing  kernel  Hilbert  spaces 

We  use  cross-covariance  operators  on  reproducing  kernel  Hilbert  spaces  to  derive  an  objec¬ 
tive  function  for  dimensionality  reduction.  While  cross-covariance  operators  are  generally 
defined  for  random  variables  in  Banach  spaces  (Vakhania  et  al.,  1987,  Baker,  1973),  the 
theory  is  much  simpler  for  reproducing  kernel  Hilbert  spaces.  We  summarize  only  basic 
mathematical  facts  in  this  subsection,  and  defer  the  details  to  the  Appendix.  Let  (7i,  k )  be 
a  reproducing  kernel  Hilbert  space  of  functions  on  a  set  H  with  a  positive  definite  kernel 
k  :  H  x  H  — >  M.  The  inner  product  of  7i  is  denoted  by  (■,-)rH-  We  consider  only  real  Hilbert 
spaces  for  simplicity.  The  most  important  aspect  of  reproducing  kernel  Hilbert  spaces  is 
the  reproducing  property: 

(/,  k(-,x))fi  =  f(x)  for  all  x  E  H  and  /  e  TL.  (9) 

Throughout  this  paper  we  use  the  Gaussian  kernel 

k(x  i,x2)  =  exp(-||xi  —  x2||2/cr2),  (10) 

which  corresponds  to  a  Hilbert  space  of  smooth  functions. 

Let  (7ii,ki)  and  (?72,/c2)  be  reproducing  kernel  Hilbert  spaces  over  measurable  spaces 
(Hi,  Hi)  and  (fi2,  H2),  respectively,  with  k\  and  fc2  measurable.  For  a  random  vector  ( X ,  Y) 
on  Hi  x  H2,  the  cross-covariance  operator  from  l~i\  to  H2  is  defined  by  the  relation 

(g,  YYXf)n2  =  EXY[f{X)g{Y)\  -  Ex[f(X)]EY[g(Y)}  (11) 
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for  all  /  E  7ii  and  g  E  Ti-i-  Eq.  (11)  implies  that  the  covariance  of  f(X)  and  g(Y )  is  given 
by  the  action  of  the  linear  operator  Eyy  and  the  inner  product.  (See  the  Appendix  for  a 
basic  exposition  of  cross-covariance  operators.) 

Covariance  operators  provide  a  useful  framework  for  discussing  conditional  probability 
and  conditional  independence.  As  we  show  in  Corollary  3  of  the  Appendix,  the  following 
relation  holds  between  the  conditional  expectation  and  the  cross-covariance  operator,  given 
that  Eyy  is  invertible:1 

ey\x[9(y)  \  x]  =  'sx1XTixy9  for  all  g  G  TI2,  (12) 

Eq.  (12)  can  be  understood  by  analogy  to  the  conditional  expectation  of  Gaussian  random 
variables.  If  X  and  Y  are  Gaussian  random  variables,  it  is  well  known  that  the  conditional 
expectation  is  given  by 


Ey\x[(JTY  \  X  =  x\  =  xTT,x1xT,XYa,  (13) 

for  an  arbitrary  vector  a,  where  Y,Xx  and  E xy  are  the  variance-covariance  matrices  in  the 
ordinary  sense. 

3.2  Conditional  covariance  operators  and  conditional  independence 

We  derive  an  objective  function  for  characterizing  conditional  independence  using  cross¬ 
covariance  operators.  Suppose  we  have  random  variables  X  and  Y  on  Rm  and  R  ,  respec¬ 
tively.  The  variable  X  is  decomposed  into  U  G  Rr  and  V  G  Rm~r  so  that  X  =  (U,  V). 
For  the  function  spaces  corresponding  to  Y ,  U  and  V,  we  consider  the  reproducing  kernel 
Hilbert  spaces  (Hi,ki),  (^2,^2),  and  (74,3,  fc.3)  on  R£,  Rr,  and  Rm_r,  respectively,  each  en¬ 
dowed  with  Gaussian  kernels.  We  define  the  conditional  covariance  operator  Eyyij/  011 
by 

Eyy|t/  '■=  Syy  -  Yyu^uu^uy,  (14) 

where  Eyy ,  E jju,  E yu  are  the  corresponding  covariance  operators.  As  shown  by  Proposi¬ 
tion  5  in  the  Appendix,  the  operator  Eyyju  captures  the  conditional  variance  of  a  random 
variable  in  the  following  way 

(g,^YY\u9)ni  =  Eu  [Vary \u [g(Y)  \  U]],  (15) 

where  g  is  an  arbitrary  function  in  Tt\.  As  in  the  case  of  Eq.  (13),  we  can  make  an  analogy 
to  Gaussian  variables.  In  particular,  Eqs.  (14)  and  (15)  can  be  viewed  as  the  analogs  of  the 
following  well-known  equality  for  the  conditional  variance  of  Gaussian  variables: 

Var[aTy  |  U]  =  aT (Eyy  —  Ey[/Ejy^-S[/y)a.  (16) 

It  is  natural  to  use  minimization  of  T,yy\u  as  a  basis  of  a  method  for  finding  the  most 
informative  direction  U.  This  intuition  is  justified  theoretically  by  Theorem  7  in  Appendix. 
That  theorem  shows  that 


Eyy|u  >  Eyy|x  for  any  U, 
1.  Even  if  Sxx  is  not  invertible,  a  similar  fact  holds.  See  Corollary  3. 


(17) 
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and 

£ yy\u  -  s yy\x  =  O  Y A.V  |  U,  (18) 

where,  in  Eq.  (17),  the  inequality  should  be  understood  as  the  partial  order  of  self-adjoint 
operators.  From  these  relations,  the  effective  subspace  S  can  be  characterized  in  terms  of 
the  solution  to  the  following  minimization  problem: 


min  E 

s 


YY\m 


subject  to  U  = 


(19) 


In  the  following  section  we  show  how  to  turn  this  population-based  criterion  into  a  sampled- 
based  criterion  that  can  be  optimized  in  the  presence  of  a  finite  sample. 


3.3  Kernel  generalized  variance  for  dimensionality  reduction 

To  derive  a  sampled-based  objective  function  from  Eq.  (19),  we  have  to  estimate  the  con¬ 
ditional  covariance  operator  with  given  data,  and  choose  a  specific  way  to  evaluate  the  size 
of  self-adjoint  operators. 

For  the  estimation  of  the  operator,  we  follow  the  procedure  described  by  Bach  and  Jor¬ 
dan  (2002a)  in  their  derivation  of  kernel  ICA.  Let  Ky  be  the  centralized  Gram  matrix  (Bach 
and  Jordan,  2002a,  Scholkopf  et  al.,  1998),  defined  by 

I<Y  =  {In  ~  £l„l  Tn)GY{ln  ~  ^n) ,  (20) 

where  (Gy)ij  =  Aq(Yj,  Yj)  is  the  Gram  matrix  and  ln  =  (1, . . . ,  1)T  is  the  vector  with  all 
elements  equal  to  1.  The  matrices  Ku  and  I\y  are  defined  similarly,  using  {Lh}f=1  and 
{Vi}^= l>  respectively.  The  empirical  conditional  covariance  matrix  Eyyijy  is  then  defined 

by 

^YY\u  '■=  Syy  -  Ey[/S^S[/y  =  (Ky  +  £ln)2  -  KyKu(Ku  +  £ln)~2KuKy ,  (21) 

where  e  >  0  is  a  regularization  constant. 

The  size  of  Eyyiy  in  the  ordered  set  of  positive  definite  matrices  can  be  evaluated  by 
its  determinant.  Although  there  are  other  choices  for  measuring  the  size  of  Eyyif/,  such 
as  the  trace  and  the  largest  eigenvalue,  we  focus  on  the  determinant  in  this  paper.  Using 
the  Schur  decomposition  det (A  —  BC~1BT)  =  det^ c^) /detC,  the  determinant  of  Syy|£/ 
can  be  written  as  follows: 


det  Eyylf/  — 

det  E [yu\[YU] 
det  Ef/fj 

(22) 

where  E  rymrym  is  defined  by 

y,  /Eyy  Eye j\ 

f  ( Ky  +  eln )2 

kYku  \ 

(23) 

^YUIVV]={£uy  £uv)  = 

V  KjjKy 

(ku  +  eln)2) 

We  symmetrize  the  objective  function  by  dividing  by  the  constant  detEyy,  which  yields 
the  following  objective  function 

det  t[YU][yu]  ^ 

det  Syy  det  E uu 
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We  refer  to  the  problem  of  minimizing  this  function  with  respect  to  the  choice  of  subspace 
S  as  Kernel  Dimensionality  Reduction  (KDR). 

Eq.  (24)  has  been  termed  the  “kernel  generalized  variance”  by  Bach  and  Jordan  (2002a), 
who  used  it  as  a  contrast  function  for  independent  component  analysis.  In  that  setting,  the 
goal  is  to  minimize  a  mutual  information  (among  a  set  of  recovered  “source”  variables), 
in  the  attempt  to  obtain  independent  components.  Bach  and  Jordan  (2002a)  showed  that 
the  kernel  generalized  variance  is  in  fact  an  approximation  of  the  mutual  information  of 
the  recovered  sources,  when  this  mutual  information  is  expanded  around  the  manifold  of 
factorized  distributions.  In  the  current  setting,  on  the  other  hand,  our  goal  is  to  maximize 
the  mutual  information  I (Y,  U),  and  we  certainly  do  not  expect  to  be  near  a  manifold  in 
which  Y  and  U  are  independent.  Thus  the  argument  for  the  kernel  generalized  variance  as 
an  objective  function  in  the  ICA  setting  does  not  apply  here.  What  we  have  provided  in 
the  previous  section  is  an  entirely  distinct  argument  that  shows  that  the  kernel  generalized 
variance  is  in  fact  an  appropriate  objective  function  for  the  dimensionality  reduction  prob¬ 
lem,  and  that  minimizing  the  kernel  generalized  variance  in  Eq.  (24)  can  be  viewed  as  a 
surrogate  for  maximizing  the  mutual  information  I(Y,  U ). 

Given  that  the  numerical  task  that  must  be  solved  in  KDR  is  the  same  as  the  numerical 
task  that  must  be  solved  in  kernel  ICA,  however,  we  can  import  all  of  the  computational 
techniques  developed  by  Bach  and  Jordan  (2002a)  for  minimizing  kernel  generalized  variance 
in  the  KDR  setting.  In  particular,  the  optimization  routine  that  we  use  in  our  experiments 
is  gradient  descent  with  a  line  search,  where  we  exploit  incomplete  Cholesky  decomposition 
to  reduce  the  n  x  n  matrices  required  in  Eq.  (24)  to  low-rank  approximations.  To  cope 
with  local  optima,  we  make  use  of  an  annealing  technique,  in  which  the  scale  parameter  a 
for  the  Gaussian  kernel  is  decreased  gradually  during  the  iterations  of  optimization.  For 
a  larger  a.  the  contrast  function  has  fewer  local  optima,  which  makes  optimization  easier. 
The  search  becomes  more  accurate  as  a  is  decreased. 

4.  Experimental  results 

We  study  the  effectiveness  of  the  new  method  through  experiments,  comparing  it  with 
several  conventional  methods:  SIR,  pHd,  CCA,  and  PLS.  For  the  experiments  with  SIR 
and  pHd,  we  use  an  implementation  for  R  due  to  Weisberg  (2002). 

4.1  Synthetic  data 

The  first  data  sets  A  and  B  comprise  one-dinrensional  Y  and  two-dimensional  X  =  (X\ ,  X2). 
One  hundred  i.i.d.  data  points  are  generated  by 

A:  Y  ~  l/(l+exp(— Xi))  +  Z, 

B:  Y  ~  2exp(-A12) +  Z, 

where  Z  ~  A^(0,0.12),  and  X  =  (X\,X2)  follows  a  normal  distribution  and  a  normal 
mixture  with  two  components  for  A  and  B,  respectively.  The  effective  subspace  is  spanned 
by  Bq  =  (1,  0)T  in  both  cases.  The  data  sets  are  depicted  in  Figure  2. 

Table  1  shows  the  angles  between  B$  and  the  estimated  direction.  For  Data  A,  all 
the  methods  except  PLS  yield  a  good  estimate  of  Bq.  Data  B  is  surprisingly  difficult  for 
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Figure  2:  Data  A  and  B.  One  dimensional  Y  depends  only  on  X\  in  X  =  (Ad,  X2). 


the  conventional  methods,  presumably  because  the  distribution  of  X  is  not  spherical  and 
the  regressor  has  a  strong  nonlinearity.  The  KDR  method  succeeds  in  finding  the  correct 
direction  for  both  data  sets. 

Data  C  has  300  samples  of  17  dimensional  X  and  one  dimensional  Y,  which  are  generated 
by 

C  :  Y  ~  0.9Xi  +  0.2— — —  +  Z,  (25) 

1  +  A  i7 


where  Z  ~  1V(0,0.012)  and  X  follows  a  uniform  distribution  on  [0, 1] 1 ' .  The  effective 
subspace  is  given  by  &i  =  (1,0,...,  0)  and  62  =  (0, . . .  ,  0, 1).  We  compare  the  KDR  method 
with  SIR  and  pHd  only — CCA  and  PLS  cannot  find  a  2-dinrensional  subspace,  because  Y 
is  one-dinrensional.  To  evaluate  the  accuracy  of  the  results,  we  use  the  multiple  correlation 
coefficient 


R(b )  =  max 


P  YXxb 


^ B  ^(3TYxxi3  •  6T£  xxb 


[b  e  II 


oJ 


(26) 


which  is  used  in  Li  (1991).  As  shown  in  Table  2,  the  KDR  method  outperforms  the  others 
in  finding  the  weak  contribution  of  the  second  direction. 


4.2  Real  data:  classification 

In  this  section  we  apply  the  KDR  method  to  classification  problems.  Many  conventional 
methods  of  dimensionality  reduction  for  regression  are  not  suitable  for  classification.  In 
particular,  in  the  case  of  SIR,  the  dimensionality  of  the  effective  subspace  must  be  less  than 
the  number  of  classes,  because  SIR  uses  the  average  of  X  in  slices  along  the  variable  Y . 
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SIR 

pHd 

CCA 

PLS 

Kernel 

A:  angle  (rad.) 

0.0087 

-0.1971 

0.0099 

0.2736 

-0.0014 

B:  angle  (rad.) 

-1.5101 

-0.9951 

-0.1818 

0.4554 

0.0052 

Table  1:  Angles  between  the  true  and  the  estimated  spaces  for  Data  A  and  B. 


SIR(10) 

SIR(15) 

SIR(20) 

SIR(25) 

pHd 

Kernel 

R(bi) 

0.987 

0.993 

0.988 

0.990 

0.110 

0.999 

R(b2) 

0.421 

0.705 

0.480 

0.526 

0.859 

0.984 

Table  2:  Correlation  coefficients  for  Data  C.  SIR(m)  indicates  the  SIR  with  m  slices. 


Thus,  in  binary  classification,  only  a  one-dinrensional  subspace  can  be  found,  because  at 
most  two  slices  are  available.  The  methods  CCA  and  PLS  have  a  similar  limitation  on  the 
dimensionality  of  the  effective  subspace;  they  cannot  find  a  subspace  of  larger  dimension¬ 
ality  than  that  of  Y .  Thus  our  focus  is  the  comparison  between  KDR  and  pHd,  which  is 
applicable  to  general  binary  classification  problems.  Note  that  Cook  and  Lee  (1999)  dis¬ 
cuss  dimensionality  reduction  methods  for  binary  classification,  and  propose  the  difference 
of  covariance  (DOC)  method.  They  compare  pHd  and  DOC  theoretically,  and  show  that 
these  methods  are  the  same  in  binary  classification  if  the  population  ratio  of  the  classes  is 
1/2,  which  is  almost  the  case  in  our  experiments. 

In  the  first  experiment,  we  show  the  visualization  capability  of  the  dimensionality  reduc¬ 
tion  methods.  We  use  the  Wine  data  set  in  the  UCI  machine  learning  repository  (Murphy 
and  Aha,  1994)  to  see  how  the  projection  onto  a  low-dinrensional  space  realizes  an  effective 
description  of  data.  The  wine  data  consist  of  178  samples  with  13  variables  and  a  label  of 
three  classes.  We  apply  the  KDR  method,  CCA,  PLS,  SIR,  and  pHd  to  these  data.  Figure  3 
shows  the  projection  onto  the  2-dinrensional  subspace  estimated  by  each  method.  The  KDR 
method  separates  the  data  into  three  classes  most  completely,  while  CCA  also  shows  per¬ 
fect  separation.  We  can  see  that  the  data  are  nonlinearly  separable  in  the  two-dimensional 
space.  The  other  methods  do  not  separate  the  classes  completely. 

Next  we  investigate  how  much  information  on  Y  is  preserved  in  the  estimated  subspace. 
After  reducing  the  dimensionality,  we  use  the  support  vector  machine  (SVM)  method  to 
build  a  classifier  in  the  reduced  space,  and  compare  its  accuracy  with  an  SVM  trained 
using  the  full  dimensional  vector  X2.  We  use  the  Heart-disease  data  set  3,  Ionosphere ,  and 
Wisconsin-breast- cancer  from  the  UCI  repository.  A  description  of  these  data  is  presented 
in  Table  3. 

Figure  4  shows  the  classification  rates  for  the  test  set  in  subspaces  of  various  dimen¬ 
sionality.  We  can  see  that  KDR  yields  good  separation  even  in  low-dinrensional  subspaces, 

2.  In  our  experiments  with  the  SVM,  we  used  the  Matlab  Support  Vector  Toolbox  by  S.  Gunn;  see 
http://www.isis.ecs.soton.ac.uk/resources/svminfo. 

3.  We  use  the  Cleveland  data  set,  created  by  Dr.  Robert  Detrano  of  V.A.  Medical  Center,  Long  Beach  and 
Cleveland  Clinic  Foundation.  Although  the  original  data  set  has  five  classes,  we  use  only  “no  presence” 
(0)  and  “presence”  (1-4)  for  the  binary  class  labels.  Samples  with  missing  values  are  removed  in  our 
experiments. 
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Data  set 

dim.  of  X 

training  sample 

test  sample 

Heart-disease 

13 

149 

148 

Ionosphere 

34 

151 

200 

Breast-cancer-Wisconsin 

30 

200 

369 

Table  3:  Data  description  for  the  binary  classification  problem. 


while  pHd  is  much  worse  in  low  dimensions.  It  is  noteworthy  that  in  the  Ionosphere  data  set 
the  classifier  in  dimensions  5,  10,  and  20  outperforms  the  classifier  in  the  full  dimensional 
space.  This  is  presumably  due  to  the  suppression  of  noise  irrelevant  to  the  prediction  of  Y . 
These  results  show  that  the  kernel  method  successfully  finds  an  effective  subspace  which 
preserves  the  class  information  even  when  the  dimensionality  is  reduced  significantly. 

5.  Extension  to  variable  selection 

In  this  section,  we  describe  an  extension  of  the  KDR  method  to  the  problem  of  variable 
selection.  Variable  selection  is  different  from  dimensionality  reduction;  the  former  involves 
selecting  a  subset  of  the  explanatory  variables  {X\, . . .  ,  Xm}  in  order  to  obtain  a  simplified 
prediction  of  Y  from  X,  while  the  latter  involves  finding  linear  combinations  of  the  variables. 
However,  the  objective  function  that  we  have  presented  for  dimensionality  reduction  can 
be  extended  straightforwardly  to  variable  selection.  In  particular,  given  a  fixed  number  of 
variables  to  be  selected,  we  can  compare  the  KGV  for  subspaces  spanned  by  combinations  of 
this  number  of  selected  variables.  This  gives  a  reasonable  way  to  select  variables,  because  for 
a  subset  W  =  {XJ1 , . . . ,  Xjr}  C  {X\ , . . .  ,Xm},  the  variables  Y  and  Wc  are  conditionally 
independent  given  W  if  and  only  if  Y  and  Tl\ycX  are  conditionally  independent  given  UyyX, 
where  H\y  and  n^c  are  the  orthogonal  projections  onto  the  subspaces  spanned  by  W  and 
W  ,  respectively.  If  we  try  to  select  r  variables  from  among  m  explanatory  variables,  the 
total  number  of  evaluations  is  (™) . 

When  (mj  is  large,  we  must  address  the  computational  cost  that  arises  in  comparing 
large  numbers  of  subsets.  As  in  most  other  approaches  to  variable  selection  (see,  e.g.,  Guyon 
and  Elisseeff,  2003),  we  propose  the  use  of  a  greedy  algorithm  and  random  search  for  this 
combinatorial  aspect  of  the  problem.  (In  the  experiments  presented  in  the  current  paper, 
however,  we  confine  ourselves  to  small  problems  in  which  all  combinations  are  tractably 
evaluated) . 

We  apply  this  kernel-based  method  of  variable  selection  to  the  Boston  Housing  data 
(Harrison  and  Rubinfeld,  1978)  and  the  Ozone  data  (Breiman  and  Friedman,  1985),  which 
have  been  often  used  as  testbed  examples  for  variable  selection.  Tables  4  and  5  give  the 
detailed  description  of  the  data  sets.  There  are  506  samples  in  the  Boston  Housing  data,  for 
which  the  variable  MV,  the  median  value  of  house  prices  in  a  tract,  is  estimated  by  using 
the  13  other  variables.  We  use  the  corrected  version  of  the  data  set  given  by  Gilley  and 
Pace  (1996).  In  the  Ozone  data  in  which  there  are  330  samples,  the  variable  UP03  (the 
ozone  concentration)  is  to  be  predicted  by  9  other  variables. 

Table  6  shows  the  best  three  sets  of  four  variables  that  attain  the  smallest  values  of  the 
kernel  generalized  variance.  For  the  Boston  Housing  data,  RM  and  LSTAT  are  included 
in  all  the  three  of  the  result  sets  in  Table  6,  and  PTRATIO  and  TAX  are  included  in  two 
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(a)  Heart-disease 


(b)  Ionosphere 


(c)  Wisconsin  Breast  Cancer 
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Figure  4:  Classification  accuracy  of  the  SVM  for  test  data  after  dimensionality  reduction. 
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Variable 

MV 

CRIM 

ZN 

INDUS 

CHAS 

NOX 

RM 

AGE 

DIS 

RAD 

TAX 

PTRATIO 

B 

LSTAT 


Description 

median  value  of  owner-occupied  home 
crime  rate  by  town 

proportion  of  town’s  residential  land  zoned  for  lots 
greater  than  25,000  square  feet 
proportion  of  nonretail  business  acres  per  town 
Charles  River  dummy 

(=  1  if  tract  bounds  the  Charles  River,  0  otherwise) 
nitrogen  oxide  concentration  in  pphrn 
average  number  of  rooms  in  owner  units 
proportion  of  owner  units  build  prior  to  1940 
weighted  distances  to  five  employment  centers 
in  the  Boston  region 
index  of  accessibility  to  radial  highways 
full  property  tax  rate  ($/$10,000) 
pupil-teacher  ratio  by  town  school  district 
black  proportion  of  population 
proportion  of  population  that  is  lower  status 

Table  4:  Boston  Housing  Data 


of  them.  This  observation  agrees  well  with  the  analysis  using  alternating  conditional  ex¬ 
pectation  (ACE)  by  Breiman  and  Friedman  (1985),  which  gives  RM,  LSTAT,  PTRATIO, 
and  TAX  as  the  four  major  contributors.  The  original  motivation  in  the  study  was  to  in¬ 
vestigate  the  influence  of  nitrogen  oxide  concentration  (NOX)  on  the  house  price  (Harrison 
and  Rubinfeld,  1978).  In  accordance  with  the  previous  studies,  our  analysis  shows  a  rel¬ 
atively  small  contribution  of  NOX.  For  the  Ozone  data,  all  three  of  the  result  sets  in  the 
variable  selection  method  include  HMDT,  SBTP,  and  IBHT.  The  variables  IBTP,  DGPG, 
and  VDHT  are  chosen  in  one  of  the  sets.  This  shows  a  fair  accordance  with  earlier  results 
by  Breiman  and  Friedman  (1985)  and  Li  et  al.  (2000);  the  former  concludes  by  ACE  that 
SBTP,  IBHT,  DGPG,  and  VSTY  are  the  most  influential,  and  the  latter  selects  HMDT, 
IBHT,  and  DGPG  using  a  pHd-based  method. 

6.  Conclusion 

We  have  presented  KDR,  a  new  kernel-based  approach  to  dimensionality  reduction  for  re¬ 
gression  and  classification.  One  of  the  most  notable  aspects  of  this  method  is  its  generality — 
we  do  not  impose  any  strong  assumptions  on  either  the  conditional  or  the  marginal  distri¬ 
bution.  This  allows  the  method  to  be  applicable  to  a  wide  range  of  problems,  and  gives  it 
a  significant  practical  advantage  over  existing  methods  such  as  CCA,  PPR,  SIR,  pHd,  and 
so  on.  These  methods  all  impose  significant  restrictions  on  the  conditional  probability,  the 
marginal  distribution,  or  the  dimensionality  of  the  effective  subspaces. 

Our  experiments  have  shown  that  the  KDR  method  can  provide  many  of  the  desired 
effects  of  dimensionality  reduction:  it  provides  data  visualization  capabilities,  it  can  suc¬ 
cessfully  select  important  explanatory  variables  in  regression,  and  it  can  yield  classification 
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Variable 

Description 

UP03 

-  upland  ozone  concentratin  (ppnr) 

VDHT 

—  Vandenburg  500  millibar  height  (nr) 

HMDT 

—  hinridity  (percent) 

IBHT 

—  inversion  base  height  (ft.) 

DGPG 

-  Daggett  pressure  gradient  (mmhg) 

IBTP 

—  inversion  base  temperature  (°F) 

SBTP 

—  Sandburg  Air  Force  Base  temperature  (°C) 

VSTY 

—  visibility  (miles) 

WDSP 

—  wind  speed  (nrph) 

DAY 

—  day  of  the  year 

Table  5:  Ozone  data 


Boston 

1st 

2nd 

3rd 

CRIM 

X 

ZN 

INDUS 

CHAS 

Ozone 

1st 

2nd 

3rd 

NOX 

VDHT 

X 

RM 

X 

X 

X 

HMDT 

X 

X 

X 

AGE 

IBHT 

X 

X 

X 

DIS 

X 

DGPG 

X 

RAD 

IBTP 

X 

TAX 

X 

X 

SBTP 

X 

X 

X 

PTRATIO 

X 

X 

VSTY 

B 

WDSP 

LSTAT 

X 

X 

X 

DAY 

KGV 

.1768 

.1770 

.1815 

KGV 

.2727 

.2736 

.2758 

Table  6:  Variable  selection  using  the  proposed  kernel  method. 
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performance  that  is  better  than  the  performance  achieved  with  the  full-dimensional  covari¬ 
ate  space.  We  have  also  discussed  the  extension  of  the  KDR  method  to  variable  selection. 
Experiments  with  classical  data  sets  has  shown  an  accordance  with  the  previous  results  on 
these  data  sets  and  suggest  that  further  study  of  this  application  of  KDR  is  warranted. 

The  theoretical  basis  of  KDR  lies  in  the  nonparametric  characterization  of  conditional 
independence  that  we  have  presented  in  this  paper.  Extending  earlier  work  on  the  kernel- 
based  characterization  of  independence  in  ICA  (Bach  and  Jordan,  2002a),  we  have  shown 
that  conditional  independence  can  be  characterized  in  terms  of  covariance  operators  on  a 
reproducing  kernel  Hilbert  space.  While  our  focus  has  been  on  the  problem  of  dimensionality 
reduction,  it  is  also  worth  noting  that  there  are  many  other  possible  applications  of  this 
characterization.  In  particular,  conditional  independence  plays  an  important  role  in  the 
structural  definition  of  probabilistic  graphical  models,  and  our  results  may  have  applications 
to  model  selection  and  inference  in  graphical  models. 

There  are  several  statistical  problems  which  need  to  be  addressed  in  further  research  on 
KDR.  First,  a  basic  analysis  of  the  statistical  consistency  of  the  KDR-based  estimator — the 
convergence  of  the  estimator  to  the  true  subspace  when  such  a  space  really  exists — is  needed. 
Second,  and  most  significantly,  we  need  rigorous  methods  for  choosing  the  dimensionality 
of  the  effective  subspace.  If  the  goal  is  that  of  achieving  high  predictive  performance  after 
dimensionality  reduction,  we  can  use  one  of  many  existing  methods  (e.g.,  cross-validation, 
penalty-based  methods)  to  assess  the  expected  generalization  as  a  function  of  dimensionality. 
Note  in  particular  that  by  using  KDR  as  a  method  to  select  an  estimator  given  a  fixed 
dimensionality,  we  have  substantially  reduced  the  number  of  hypotheses  being  considered, 
and  expect  to  find  ourselves  in  a  regime  in  which  methods  such  as  cross-validation  are 
likely  to  be  effective.  It  is  also  worth  noting,  however,  that  the  goals  of  dimensionality 
reduction  are  not  always  simply  that  of  prediction;  in  particular,  the  search  for  small  sets 
of  explanatory  variables  will  need  to  be  guided  by  other  principles.  Finally,  asymptotic 
analysis  may  provide  useful  guidance  for  selecting  the  dimensionality;  an  example  of  such 
an  analysis  that  we  believe  can  be  adopted  for  KDR  has  been  presented  by  Li  (1991)  for 
the  SIR  method. 
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Appendix  A.  Cross-covariance  operators  on  reproducing  kernel  Hilbert 
spaces  and  independence  of  random  variables 

A.l  Cross-covariance  operators 

While  cross-covariance  operators  are  generally  defined  for  random  variables  on  Banach 
spaces  Vakhania  et  al.  (1987),  Baker  (1973),  they  are  more  easily  defined  on  reproducing 
kernel  Hilbert  spaces  (RKHS).  In  this  subsection,  we  summarize  some  of  the  basic  math- 
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ematical  facts  used  in  Sections  3.1  and  3.3.  While  we  discuss  only  real  Hilbert  spaces, 
extension  to  the  complex  case  is  straightforward. 

Theorem  1  Let  (Hi,  Hi)  and  (^,£>2)  be  measurable  spaces,  and  let  (TLi,k{)  and  (H2,  k2) 
be  reproducing  kernel  Hilbert  spaces  on  Hi  and  H2,  respectively,  with  k\  and  k2  measur¬ 
able.  Suppose  we  have  a  random  vector  (X,Y)  on  Hi  x  H2  such  that  Ex[k±(X,  X)]  and 
EY[k2(Y,  Y)]  are  finite.  Then,  there  exists  a  unique  operator  Yyx  from  Ti\  to  77 2  such  that 

(g,  EYXf)n2  =  EXY [f{X)g(Y)}  -  Ex[f(X)]EY[g(Y)}  (27) 

holds  for  all  f  £  Ti\  and  g  £  0.2-  This  is  called  the  cross-covariance  operator. 

Proof  Obviously,  the  operator  is  unique,  if  it  exists.  From  Riesz’s  representation  theo¬ 
rem  (see  Reed  and  Simon,  1980,  Theorem  II. 4,  for  example),  the  existence  of  YYXf  £  Hi 
for  a  fixed  /  can  be  proved  by  showing  that  the  right  hand  side  of  Eq.  (27)  is  a  bounded 
linear  functional  on  0.2-  The  linearity  is  obvious,  and  the  boundedness  is  shown  by 

| EXY[f{X)g{Y)}  -  Ex[f(X)]EY[g(Y)}\ 

/leqEXY\(ki(-,X),  f)ni{k2{-,Y),g)n2  \  +  Ex\  (h(-,  X),  f)Hl  \  ■  Ey\(k2(-,Y),  g)n2\ 
<EYy[||fci(-,X)||Wl||/||Wl||fc2(-,T)||W2|bllw2] 

+  Ex\\\k1{-,X)\\nJf\\ni]EY\\\k2{-,Y)\\nJg\\n2] 

<{Ex[^(X,A0]1/2Ey[fc2(T,T)]1/2  +  EY[fci(X,X)1/2]jEy[A;2(y)y)i/2]}||/||Wi||5||W2. 

(28) 

For  the  last  inequality,  ||A:(-,x)||^  =  k(x,x)  is  used.  The  linearity  of  the  map  Eyx  is  given 
by  the  uniqueness  part  of  Riesz’s  representation  theorem.  ■ 


From  Eq.  (28),  Eyx  is  bounded,  and  by  definition,  we  see  YyX  =  Exy,  where  A*  denotes 
the  adjoint  of  A.  If  the  two  RKHS  are  the  same,  the  operator  Yxx  is  called  the  covariance 
operator.  A  covariance  operator  Y>xx  is  bounded,  self-adjoint,  and  trace-class. 

In  an  RKHS,  conditional  expectations  can  be  expressed  by  cross-covariance  operators, 
in  a  manner  analogous  to  finite-dimensional  Gaussian  random  variables. 

Theorem  2  Let  ifH\,k\)  and  (TL2,k2)  be  RKHS  on  measurable  spaces  Hi  and  Ll2,  respec¬ 
tively,  with  k\  and  k2  measurable,  and  (X,Y)  be  a  random  vector  on  Hi  x  H2.  Assume  that 
Ex[k\(X,  X)\  and  EY [^(T,  T)]  are  finite,  and  for  all  g  £  H.2  the  conditional  expectation 
EY\x[g(Y)  |  X  =  ■]  is  an  element  of  Hi.  Then,  we  have  for  all  g  £  TL2 

'S‘XxEy\x[g{Y)  |  X]  =  T,Xyg,  (29) 

where  SYY  and  YXY  are  the  covariance  and  cross-covariance  operator. 

Proof  For  any  /  £  7i.\ ,  we  have 

(f,YxxEY]x[g(Y)  \X])Hl 

=  Ex[f{X)Enx[g{Y)  |  A]]  -  Ex[f(X)]Ex[EYlx[g(Y)  \  A]] 

=  Exy[f(X)g(Y)}  -  Ex[f(X)]EY\g(Y)}  =  (/,  TXYg)Hl. 
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This  completes  the  proof. 


Corollary  3  Let  be  the  right  inverse  ofYXx  on  (KerYXx)±  ■  Under  the  same  as¬ 
sumptions  as  Theorem  2,  we  have 

(/,  txx^XYg)  =  (/,  EY]x[g(Y)  |  X])  (30) 

for  all  f  G  (KerSxx)"1"  and  g  G  H2.  In  particular,  if  KerSxx  =  0,  we  have 

=  EY\ x[g{Y)  I  X],  (31) 

Proof  Note  that  the  product  E^Exy  is  well-defined,  because  RangeExv  C  RangeSxx  = 
(KerSxx)-1--  The  first  inclusion  is  shown  from  the  expression  Exy  =  E^TEyy  with  a 
bounded  operator  V  (Baker,  1973,  Theorem  1),  and  the  second  equation  holds  for  any 
self-adjoint  operator.  Take  /  =  YXxh  G  RangeSxx-  Then,  Theorem  2  yields 

{f,±xxVxY9)  =  {h,Hxxt-xlxTxxEnx[g{Y)  \  X }) 

=  (h,YxxEylx[g(Y)  |  X])  =  (f,EYlx\g(Y)  \  X ]). 


This  completes  the  proof. 


The  assumption  Ey\x[g(Y)  \  X  =  •]  G  Hi  in  Theorem  2  can  be  simplified  so  that  it  can 
be  checked  without  reference  to  a  specific  g. 

Proposition  4  Under  the  condition  of  Theorem  2,  if  there  exists  C  >  0  such  that 

Ey\xMvi,Y)  |  X  =  x1]EY\x[k2(y2,Y)  \  X  =  x2]  <  Cki(xi,x2)k2(yi,y2)  (32) 

for  all  x\,x2  G  Hi  and  yi,y2  G  Ll2,  then  for  all  g  G  H2  the  conditional  expectation 
Ey\x[g(Y)  |  X  =  •]  is  an  element  of  Hi. 

Proof  See  Theorem  2.3.13  in  Alpay  (2001).  ■ 

For  a  function  /  in  an  RKHS,  the  expectation  of  /(X)  can  be  formulated  as  the  inner 
product  of  /  and  a  fixed  element.  Let  (fi,  B)  be  a  measurable  space,  and  {H,  k)  be  an  RKHS 
on  fl  with  k  measurable.  Note  that  for  a  random  variable  X  on  H,  the  linear  functional 
/  i— >  Ex[f{X)\  is  bounded  if  Ex[k(X,  X)]  exists.  By  Riesz’s  theorem,  there  is  u  G  H  such 
that  {u,f)-n  =  Ex[f{X)]  for  all  /  G  H.  If  we  define  Ex[k(-,X)\  €H  by  this  element  u,  we 
formally  obtain  the  equality 

(EX[k(;X)]J)n  =  Ex[(k(;X),f)n],  (33) 

which  looks  like  the  interchangeability  of  the  expectation  by  X  and  the  inner  product. 
While  the  expectation  Ex[k(-,  X)]  can  be  defined,  in  general,  as  an  integral  with  respect  to 
the  distribution  on  H  induced  by  k(-,X),  the  element  Ex[k(-,  X)\  is  formally  obtained  as 
above  in  a  reproducing  kernel  Hilbert  space. 
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A. 2  Conditional  covariance  operator  and  conditional  independence 

We  define  the  conditional  (cross-)covariance  operator,  and  derive  its  relation  with  the  con¬ 
ditional  covariance  of  random  variables.  Let  (TLi,ki),  [TL 2,^2),  let  (^3,^3)  be  RKHS  on 
measurable  spaces  fly  LI2,  and  fl3,  respectively,  and  let  (X.Y.Z)  be  a  random  vector  on 
Hixfi2x  Q3.  The  conditional  cross- covariance  operator  of  (X.Y)  given  Z  is  defined  by 

E) yx\z  ■=  Syx  -  YyzYzz'Zzx-  (34) 

Because  KerE^z  C  KerEyy  from  the  fact  E YZ  =  Syy  VE^J  for  some  bounded  operator 
V  (Baker,  1973,  Theorem  1),  the  operator  EyyE^Eyx  can  be  uniquely  defined,  even  if 
E zz  is  not  unique.  By  abuse  of  notation,  we  write  Ey^E^Exx,  when  cross-covariance 
operators  are  discussed. 

The  conditional  cross-covariance  operator  is  related  to  the  conditional  covariance  of  the 
random  variables. 

Proposition  5  Let  (TLi,k\),  (7^2,  fe),  and  (743,  ki)  be  reproducing  kernel  Hilbert  spaces  on 
measurable  spaces  fly  Ll-2,  and  ^3,  respectively,  with  ki  measurable,  and  let  ( X,Y,Z )  be 
a  measurable  random  vector  on  fli  x  Q2  x  LI3  such  that  Ex[ki(X,  X)\,  Ey [^(K,  Y")],  and 
Ez[k3(Z,  Z)\  are  finite.  It  is  assumed  that  Ex\z[f(X)  \  Z]  and  EY\z[g(Y)  \  Z]  are  elements 
of  7i 3  for  all  f  £  H\  and  g  £  H.2-  Then,  for  all  f  £  Ti\  and  g  £  TL2,  we  have 

(9^YX\zf)n2  =  EXY[f(X)g(Y)}  -  Ez[Ex\z[f(X)  \  Z]EY\z[g{Y)  \  Z]] 

=  Ez[CovXYIz(f(X),g(Y)\Z)].  (35) 

Proof  From  the  decomposition  Ey^  =  EyyFE^J,  we  have  E zYg  £  (KerE zz)±-  Then, 
by  Corollary  3,  we  obtain 

(g^Yzt^Yzxf)  =  (YZYg,t-zlzYZxf)  =  (S  ZYg,  Ex]z[f(X)  \  Z}) 

=  EYZ[g(Y)Exlz[f(X)  |  Z}]  -  Ex[f(X)]EY\g(Y)}. 

From  this  equation,  the  theorem  is  proved  by 

(g,XYX\zf)  =  EXY[f(X)g(Y)}  -  Ex[f(X)]EY[g(Y)} 

-  EYZ[g(Y)Ex\z[f(X)  |  Z}\  +  Ex[f(X)}EY\g(Y)} 

=  EXY[f  (X)g(Y)]  -  EZ  [Ex\z[f{X)  \  Z]Eylz[g(Y)  \  Zj] .  (36) 


The  following  definition  is  important  to  describe  our  main  theorem.  Let  (fl,£>)  be  a 
measurable  space,  let  (H,  k )  be  a  RKHS  over  fl  with  k  measurable  and  bounded,  and  let 
S  be  the  set  of  all  the  probability  measures  on  (fl,£>).  The  RKHS  TL  is  called  probability¬ 
determining,  if  the  map 


S3  P  h-  (/  ~  Ex„P[f(X)])  £  H* 


(37) 
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is  one-to-one,  where  H*  is  the  dual  space  of  H.  From  Riesz’s  theorem,  H  is  probability¬ 
determining  if  and  only  if  the  map 

S3P  ^  Ex~p[k(-,X)]  £  H 

is  one-to-one.  Theorem  2  in  (Bach  and  Jordan,  2002a)  shows  the  following  fact: 

Theorem  6  (Bach  and  Jordan  2002a)  For  an  arbitrary  a  >  0,  the  reproducing  ker¬ 
nel  Hilbert  space  with  Gaussian  kernel  k(x,y)  =  exp(—  ||x  —  y\\2/cr)  on  Rm  is  probability¬ 
determining. 

Recall  that  for  two  RKHS  Hi  and  H2  on  Hi  and  H2,  respectively,  the  direct  product 
is  the  RKHS  on  Hi  x  H2  with  the  positive  definite  kernel  k \ k/2  (see  Aronszajn,  1950). 
The  relation  between  conditional  independence  and  the  conditional  covariance  operator  is 
given  by  the  following  theorem: 

Theorem  7  Let  (Hu,ku),  (^12,^12);  and  (H2,  ^2)  be  reproducing  kernel  Hilbert  spaces 
on  measurable  spaces  Hu,  H12,  and  H2,  respectively,  with  continuous  and  bounded  kernels. 
Let  (X,Y)  =  ( Z,  W,  Y )  be  a  random  vector  on  Hu  x  H12  x  H2,  where  X  =  (Z,W),  and 
let  Hi  =  Hu  ®H\2  be  the  direct  product.  It  is  assumed  that  EY\z[g{X)  \  Z\  £  Hu  and 
EY\x[g{X)  I  X]  £  H\  for  all  g  £  7^2 ■  Then,  we  have 

EYy\z  >  T,Yy\x ,  (38) 

where  the  inequality  refers  to  the  order  of  self-adjoint  operators,  and  if  further  H2  is 
probability -determining,  the  following  equivalence  holds 

T^yy\x  =  TjYy\z  <?==^  YAW  |  Z.  (39) 

Proof  The  right  hand  side  of  Eq.  (39)  is  equivalent  to  py\x  =  Py\Zi  where  Py\x  an<i 
Py\z  are  the  conditional  probability  of  Y  given  X  and  given  Z ,  respectively.  Taking  the 
expectation  of  the  well-known  equality 

VY\ z[g(Y)  I  Z]  =  Ew]z  [VY\Ziw[g{Y)  I  Z,  W]]  +  Vw\z  [Ey\ZiW\g(Y)  \  Z,  W]]  (40) 

with  respect  to  Z,  we  derive 

Ez  [VY\ z[g(Y)  |  Z]\  =  EX  [VY\ x[g(Y)  \  X]]  +  Ez  [Vw\z[EY\x[g{Y)  \  X]]] .  (41) 

Since  the  last  term  of  Eq.  (41)  is  nonnegative,  we  obtain  Eq.  (38)  from  Proposition  5. 

Equality  holds  if  and  only  if  Vw\z[Ey\x[9(Y)  \  X]]  =  0  for  almost  every  Z,  which  means 
Ey\x[g(Y)  |  X]  does  not  depend  on  W  almost  surely.  This  is  equivalent  to 

EY]x[g(Y)\X\=EYlz[g(Y)\Z\  (42) 

for  almost  every  Z  and  W .  Because  H2  is  probability-determining,  this  means  PY\x  =  py\z- 
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A. 3  Conditional  cross-covariance  operator  and  conditional  independence 

Theorem  7  characterizes  conditional  independence  using  the  conditional  covariance  opera¬ 
tor.  Another  formulation  is  possible  with  a  conditional  cross-covariance  operator. 

Let  (Hi,  £>i),  (fi2,  £>2),  and  (^3,  £>3)  be  measurable  spaces,  and  let  (A,  Y,  Z )  be  a  random 
vector  on  Hi  x  VL2  x  O3  with  law  Pxyz ■  We  define  a  probability  measure  Ez[Px\z  ®  Py\z\ 
on  Hi  x  f^2  by 

Ez[Px\z®Py\z\{A  x  B)  =  Ez[Ex\z[xa\Z]  Ey\z[xb  \  Z]J,  (43) 

where  xa  is  the  characteristic  function  of  a  measurable  set  A.  It  is  canonically  extended  to 
any  product-measurable  sets  in  fi 1  x  Q2 ■ 

Theorem  8  Let  (fij,£>j)  (i  =  1,2,3 )  be  a  measurable  space,  let  ( Hi,ki )  be  a  RKHS  on  H^ 
with  kernel  measurable  and  bounded,  and  let  ( X ,  Y,  Z)  be  a  random  vector  on  Hi  x  U2  x  LI3. 
It  is  assumed  that  Ex\z[f(X)  \  Z]  and  EY\z[g(Y )  |  Z\  belong  to  H3  for  all  f  £  Tii  and 
g  £  H2,  and  that  Hi  <g>  H2  is  probability -determining.  Then,  we  have 

TYx\z  =  O  <=>  PXy  =  Ez[Px\z  Zi  Py\z}-  (44) 

Proof  The  right-to-left  direction  is  trivial  from  Theorem  5  and  the  definition  of  Ez[P\\z ® 
PY\z}-  The  left-hand  side  yields  Ez[Ex]z[f(X)  \  Z]EY\z[g{Y)  \  Z]]  =  EXY[f(X)g(Y )]  for 
all  /  £  Hi  and  g  £  H2-  By  the  definition  of  Hi  <8>  H2,  we  have  E(Xi  Yt^Q[h(X' ,  T7)]  = 
Exy[h{X,Y)\  for  all  h  £  Hi®H2,  where  Q  =  E z[PX\z® Py\z]-  This  implies  the  right-hand 
side,  because  Hi  <8>  H2  is  probability-determining.  ■ 


The  right-hand  side  of  Eq.  (44)  is  weaker  than  the  conditional  independence  of  X  and 
Y  given  Z .  However,  if  Z  is  a  part  of  X,  we  obtain  conditional  independence. 


Corollary  9  Let  ( Hn,ku ),  {H12,  £12);  and  (H2,  £2)  be  reproducing  kernel  Hilbert  spaces 
on  measurable  spaces  Hu,  il\2,  and  LI2,  respectively,  with  kernels  measurable  and  bounded. 
Let  (X,Y)  =  ( Z ,  W,  Y )  be  a  random  vector  on  Hu  x  H12  x  H2,  where  X  =  ( Z ,  W),  and  let 
Hi  =  Hh<S>Hi2  be  the  direct  product.  It  is  assumed  that  Ex\z[f{X)  \  Z]  and  EY\z[g{Y)  \  Z] 
belong  to  Hu  for  all  f  £  Hi  and  g  £  H2,  and  Hi  <g>  H2  is  probability- determining.  Then, 
we  have 

EyX| z  =  0  <=>  YALW  |  Z.  (45) 


Proof  For  any  measurable  sets  A  C  Hu,  B  C  fii2,  and  C  C  H2,  we  have,  in  general, 


Ez[Ex\z[xaxb(Z,W)  I  Z]Ey\z[xc(Y)  I  Z]\  -  Exy[XAxb(Z,W)xc(Y)} 

=  Ez[Ewiz[xb(W)  I  Z]xa{Z)Ey\z[xc{Y)  \  Z\]  -  Ez[Ewyiz[xb(W)xc(Y)  \  Z]Xa{Z)\ 


-  [  {Pw\z(B 
J  A 


z)PY\z(C  |  z)  -  PWY\z(B  x  C  |  z)}dPz(z). 


(46) 


From  Theorem  8,  the  left-hand  side  of  Eq.  (45)  is  equivalent  to  Ez[Px\z  <8>  Py\z]  =  P.xy , 
which  implies  that  the  last  integral  in  Eq.  (46)  is  zero  for  all  A.  This  means  P\y\z(B  \ 
z)PY\z{C  |  z)  —  P\Yy\ z ( B  x  C  |  z)  =  0  for  almost  every  z-PZ-  Thus,  Y  and  W  are 
conditional  independent  given  Z.  The  converse  is  trivial.  ■ 


Note  that  the  left-hand  side  of  Eq.  (45)  is  not  SYWjZ  but  Yyxjz,  which  is  defined  on  the 
direct  product  Hu  <S>H  12- 
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